比利时专利BE1023435B1 Method and system for post-processing a speech recognition result

专利PDF首页>>比利时专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
A method of post-processing a voice recognition result (100), said result (100) comprising a start (111), an end (112) and a plurality of elements (113), said method comprising the following steps: reading said result (100); choosing one of its elements (113); determine if it is valid; repeating the steps of selecting an element (113) and determining whether it is valid or not; if at least one element (113) has been determined to be valid, determining a post-processed solution (200) by taking at least one such element (113) valid. The method of the invention is characterized in that each element (113) is chosen from said end (112) to said start (111) of the result (100) consecutively.
公开号:BE1023435B1
申请号:E2016/5152
申请日:2016-03-02
公开日:2017-03-20
发明作者:Jean-Luc Forster
申请人:Zetes Industries Sa；
IPC主号:

专利说明:

Method and system for post-processing a speech recognition result
Field of the invention [0001] In a first aspect, the invention relates to a method of post-processing a speech recognition result. According to a second aspect, the invention relates to a system (or device) for post-processing a voice recognition result. According to a third aspect, the invention relates to a program. According to a fourth aspect, the invention relates to a storage medium comprising instructions (for example: USB key, CD-ROM type disk or DVD). State of the art [0002] A speech recognition engine makes it possible to generate, from a spoken or audio message, a result that is generally in the form of a text or code that can be used by a machine. This technology is now widely used and is considered very useful. Various applications of voice recognition are taught in US6,754,629B1.
[0003] There are studies to improve the results provided by a voice recognition engine. For example, US2014 / 0278418A1 proposes to take advantage of the identity of a speaker to adapt accordingly the speech recognition algorithms of a speech recognition engine. This adaptation of the algorithms is therefore done within the voice recognition engine, for example by modifying its phonetic dictionary to take into account the manner in which the speaker or user speaks.
[0004] A speech recognition result generally comprises a series of elements, for example words, separated by silences. The result is characterized by a beginning and an end and its elements are temporally arranged between this beginning and this end.
A result provided by a voice recognition engine can be used for example to enter information into a computer system, for example an article number or any instruction to be made. Rather than using a raw speech recognition result, this result sometimes undergoes one or more post-processing operations to extract a post-processed solution. For example, it is possible to scan a speech recognition result from beginning to end and to remember, for example, the first five elements considered valid, if it is known that the useful information does not include more than five elements. (an element is for example a word). Indeed, knowing that the useful information (a code for example) does not include more than five words (five digits for example), it is then sometimes decided to retain only the first five valid elements of a speech recognition result. Any additional posterior element is considered superfluous with respect to the expected information and is therefore considered invalid.
Such a post-processing method does not always provide acceptable solutions. Thus, the inventors have found that such a method can lead to generating a false post-processed solution in some cases, that is to say a solution that does not correspond to the information that must actually be provided by the speaker. This post-processing method is therefore not reliable enough. SUMMARY OF THE INVENTION [0007] According to a first aspect, one of the aims of the invention is to provide a method of post-processing a speech recognition result that is more reliable. For this purpose, the inventors propose the following method. A method of post-processing a speech recognition result, said result comprising a beginning, an end and a plurality of elements distributed between said beginning and said end, said post-processing method comprising the following steps: i. receive said result; ii. isolating (or considering, selecting) an element of said plurality of elements that has not passed the validation test of step iii.a. ; iii. then. if an element was isolated in step ii., determine if it is valid using a validation test, b. otherwise, go directly to step v. ; iv. repeat steps ii. and iii. (in the following order: step ii, then step iii.); v. if at least one element has been determined valid in step iii.a, determining a post-processed solution using (or resuming) at least one determined element valid in step iii.a; characterized in that each element isolated in step ii. is chosen from said end of the result at the beginning of the result consecutively (or uninterrupted, that is to say without passing an element).
With the method of the invention, a voice recognition result is traveled from the end to the beginning. The inventors have indeed discovered that a person dictating a message to a voice recognition engine was more likely to hesitate and / or to err at first than at the end. By treating a speech recognition result with the end rather than the beginning, the method of the invention favors the part of the result that is most likely to have the right information. In the end, this method is therefore more reliable.
Let's take the following example. Imagine that a code to read is: 4531. The operator, reading it, says: "5, 4, uh, 4, 5, 3, 1". Typically, a speech engine will provide either "5, 4, 2, 4, 5, 3, 1" or "5, 4, 4, 5, 3, 1" as a result. In the first case, "uh" is associated with two, in the second case, the engine does not provide a result for "uh". Let's assume that a post-processing system (which can be built into a speech engine) knows that the result should not have more than four good elements (numbers in this case). A post-processing system that traverses the result from the beginning to the end of the result will provide as a post-processed solution: 5424 or 5445 (and not 4531). The method of the invention will provide 4531, i.e. the right solution.
The inventors have noticed that the situation illustrated by this example, that is to say the fact that an operator is more likely to hesitate or err at the beginning than at the end of the recorded sequence, is more frequent than reverse. Thus, overall, the method of the invention is more reliable because it provides less bad results. The chances of obtaining a correct post-processed solution are also higher with the method of the invention. It is therefore also more effective.
The method of the invention has other advantages. It is easy to implement. In particular, it does not require many implementation steps. The implementation steps are also simple. These aspects facilitate the integration of the method of the invention, for example at the level of a computer system using a voice recognition result, or at the level of a voice recognition engine, for example.
The post-processing method of the invention can be seen as a method of filtering a speech recognition result: indeed, the invalid elements are not used to determine the post-processed solution.
A speech recognition result is generally in the form of a text or code that can be used by a machine. An element of a result represents information of the result delimited by two different times along a time scale, t, associated with the result, and which is not considered as a silence or a background noise. Generally, an element is a group of phonemes. A phoneme is known to those skilled in the art. Preferably, an element is a word. An element can also be a group or a combination of words. An example of a combination of words is 'cancel operation'.
In the context of the invention, a voice recognition result may be of different types. According to a first possible example, a speech recognition result represents a hypothesis provided by a speech recognition engine from a message said by a user or speaker. In general, a speech recognition engine provides several (for example, three) hypotheses from a message said by a user.
In this case, it usually also provides a score (which can be expressed in different units depending on the type of speech recognition engine) for each assumption. Preferably, the post-treatment method of the invention then comprises a preliminary step of selecting only the hypothesis (s) having a score greater than or equal to a predetermined score. For example, if the voice recognition engine used is Nuance's VoCon® 3200 V3.14 model, said predetermined score is 4000. The steps described above (steps i, ii, iii, iv, v) are not then applied only to results having a score greater than or equal to said predetermined score.
According to another possible example, a voice recognition result is a solution, generally comprising a plurality of elements, obtained from one or more post-processing operations applied to one or more hypotheses (s). ) provided by a voice recognition engine. In this last example, the voice recognition result is therefore derived from a voice recognition module and issued from one or more post-processing module (s) of one or more hypotheses provided by a voice recognition engine.
If no element has been determined valid in step iii.a, step v preferably comprises a substep of providing another post-processed solution. Preferably, this other post-processed solution corresponds to a post-treated solution which does not include any element of said result. In this preferred variant and when no element has been determined valid in step iii.a, various examples of post-processed solution are: empty message that is to say not including any element (no word for example) , message stating that the post-processing was unsuccessful. According to another possible variant, this other post-processed solution corresponds to the voice recognition result if no element has been determined valid in step iii.a (no filtering of the result).
Along a time scale t associated with the result (see Figures 1 and 2 for example), the beginning of the result is prior to the end of the result.
[0018] Preferably, an element is a word. Examples of words are: one, two, car, umbrella. According to this preferred variant, the method of the invention gives even better results. Each word is determined from a message said by a user by a speech recognition engine using a dictionary. Grammar rules may reduce the choice of possible words from a dictionary.
[0019] Preferably, step iii.a. further comprises an instruction to go directly to step v. if the element undergoing the validation test of step iii.a is not determined to be valid. According to this preferred variant, a post-processed solution for which at least one element has been determined valid in step iii.a comprises only consecutive valid elements of the result of the speech recognition engine. The reliability of the method is then further improved because only a series of valid consecutive elements are kept.
[0020] Preferably, the method of the invention further comprises the following step: vi. determining whether said post-processed solution of step v. satisfies a grammar rule. By using a grammar rule, the reliability of the method of the invention can be further increased. In particular, it is better to filter out an aberrant result. An example of a grammar rule is an interval of allowed word numbers for the post-processed solution. For example, one could define as grammar rule: the post-processed solution must contain between three and six words.
Preferably, when a grammar rule is used, the method of the invention further comprises the following step: vii. at. if the answer to the test of step vi. is positive, provide said post-treated solution, b. otherwise, provide said voice recognition result.
According to another possible variant, the method of the invention comprises the following step when a grammar rule is used: vii. at. if the answer to the test of step vi. is positive (that is, the post-processed solution satisfies the grammar rule), provide the post-processed solution, b. if the answer to the test of step vi. is negative (that is, the post-processed solution does not meet the grammar rule), does not provide a post-processed solution, or provides a blank message, or provide a message that no post-solution satisfactory treatment could not be determined.
It is possible to design different validation tests of step iii.a. For example, the validation test of step iii.a. may include a step of considering a valid element if its duration is greater than or equal to a threshold of shorter duration. Each element of the result corresponds to a duration or time interval which is generally provided by the speech recognition engine. With this preferred embodiment, it is possible to overcome more effectively elements that are short-lived, such as a parasitic noise that may be from a machine.
In another example, the validation test of step iii.a. includes a step of considering a valid element if its duration is less than or equal to a threshold of greater duration. With this preferred embodiment, it is possible to overcome more effectively the elements which are of long duration, for example a hesitation of a speaker who says for example 'uh' but for which the voice recognition engine provides the word 'two' (for example because it uses a predefined grammar rule that requires it to supply only digits). By using this preferred embodiment, it will be easier to eliminate the word 'two' invalid.
In another example, said validation test of step iii.a. includes a step of considering a valid item if its confidence rate is greater than or equal to a minimum confidence level.
The reliability of the method is further increased in this case.
In another example, said validation test of step iii.a. comprises a step of considering a valid element if a time interval separating it from another directly adjacent element towards said end of the result is greater than or equal to a minimum time interval.
With this preferred embodiment, it is possible to reject more efficiently elements that are not generated by a human being but rather by a machine for example and which are temporally very close together.
[0025] Preferably, said validation test of step iii.a. comprises a step of considering a valid element if a time interval separating it from another directly adjacent element towards said end of the result is less than or equal to a maximum time interval. With this variant, it is possible to more effectively reject elements that are temporally greatly separated from each other.
According to another possible variant of the method of the invention, the validation test of step iii.a. comprises a step of considering a valid element if a time interval separating it from another directly adjacent element to said beginning of the result is greater than a minimum (time) interval.
According to another possible variant of the method of the invention, the validation test of step iii.a. comprises a step of considering a valid element if a time interval separating it from another directly adjacent element to said beginning of the result is less than a maximum (time) interval.
[0028] Preferably, said validation test of step iii.a. comprises a step of considering, for a given speaker, an element of said valid result, if a statistic associated with this element is, within an interval, consistent with a predetermined statistic for the same element and for that given speaker.
The statistic (or speech recognition statistic) associated with said element is generally provided by the speech recognition engine. Examples of statistics associated with an element are: the duration of the element, its confidence rate. Other examples are possible. It is possible to record such statistics for different elements and for different speakers (or operators), for example during a preliminary enrollment step. If one then knows the identity of the speaker having recorded a statement to which corresponds a result provided by a voice recognition engine, it is possible to compare statistics associated with the different elements of said result with pre-established statistics for these elements. and for this speaker. In this case, the method of the invention therefore preferably comprises an additional step for determining the identity of the speaker.
With this preferred embodiment, the reliability and efficiency is further increased because it is possible to take into account the vocal specificities of the speaker.
Preferably, all the elements determined valid in step iii.a are taken again to determine said post-treated solution in step v.
The inventors also propose an optimization method to provide an optimized solution from a first and a second voice recognition results and comprising the following steps: A. apply a post-processing method according to the any one of the preceding claims to said first result; B. applying a post-processing method according to any one of the preceding claims to said second result; C. determining said optimized solution from one or more elements belonging to one or more results of said first and second results and which have been determined valid by the validation test of step iii.a.
According to a second aspect, the invention relates to a system (or device) for post-processing a speech recognition result, said result comprising a beginning, an end and a plurality of elements distributed between said beginning and said end, said post-processing system comprising: - acquisition means for reading said result; - processing means: + to perform the following steps recursively: • isolate an element of said plurality of elements that has not previously undergone a validation test imposed by said processing means, • determine whether the element isolated is valid, using a validation test, and + to determine a post-processed solution by taking up at least one valid determined element; characterized in that each element isolated by the processing means is selected from said end of the result at the beginning of the result consecutively.
The advantages associated with the method according to the first aspect of the invention apply to the system of the invention, mutatis mutandis. Thus, in particular, it is possible to have a more reliable post-processed solution with the system of the invention. It is also possible to have a more efficient system to provide a correct post-processed solution. The various embodiments presented for the method according to the first aspect of the invention, apply to the system of the invention, mutatis mutandis.
According to a third aspect, the invention relates to a program (preferably a computer program) for processing a speech recognition result, said result comprising a beginning, an end and a plurality of distributed elements. between said beginning and said end, said program comprising a code to enable a device (eg a speech recognition engine, a computer capable of communicating with a voice recognition engine) to perform the following steps: i. reading said voice recognition result, ii. isolating an element of said plurality of elements that has not passed the validation test of step iii.a., iii. then. if an element was isolated in step ii., determine if it is valid using a validation test, b. otherwise, go directly to step v., iv. repeat steps ii. and iii .; v. if at least one element has been determined valid in step iii.a, determining a post-processed solution by taking up at least one determined element valid in step iii.a; characterized in that each isolated element in step ii is selected from said end of the result at the beginning of the result consecutively.
The advantages associated with the method and the system according to the first and second aspects of the invention, apply to the program of the invention, mutatis mutandis. Thus, in particular, it is possible to have a more reliable post-processed solution with the program of the invention. It is also possible to have a more efficient program to determine a correct post-processed solution. The various embodiments presented for the method according to the first aspect of the invention, apply to the program of the invention, mutatis mutandis.
If no element has been determined valid in step iii.a, step v preferably comprises the following sub-step: determining a post-processed solution that does not include any element of said result. In this preferred variant and when no element has been determined valid in step iii.a, different examples of post-processed solution are then: empty message that is to say not including any element (no word for example) ), message stating that the postprocessing was unsuccessful, result provided by voice recognition engine.
According to a fourth aspect, the invention relates to a storage medium (or recording medium) that can be connected to a device (for example, a voice recognition engine, a computer that can communicate with a recognition engine voice) and comprising instructions, which read, enable said device to process a voice recognition result, said result comprising a beginning, an end and a plurality of elements distributed between said beginning and said end, said instructions making it possible to impose said auditing device to perform the following steps: i. read said result; ii. isolating an element of said plurality of elements that has not passed the validation test of step iii.a., iii. then. if an element was isolated in step ii., determine if it is valid using a validation test, b. otherwise, go directly to step v., iv. repeat steps ii. and iii .; v. if at least one element has been determined valid in step iii.a, determining a post-processed solution by taking up at least one determined element valid in step iii.a; characterized in that each isolated element in step ii. is selected from said end of the result at the beginning of the result consecutively.
The advantages associated with the method and the system according to the first and second aspects of the invention, apply to the storage medium of the invention, mutatis mutandis. Thus, in particular, it is possible to have a more reliable post-processed solution. It is also possible to more efficiently determine a correct post-processed solution. The various embodiments presented for the method according to the first aspect of the invention, apply to the storage medium of the invention, mutatis mutandis.
If no element has been determined valid in step iii.a, step v preferably comprises the following sub-step: determining a post-processed solution that does not include any element of said result. In this preferred variant and when no element has been determined valid in step iii.a, different examples of post-processed solution are then: empty message that is to say not including any element (no word for example) ), message stating that the postprocessing was unsuccessful, result provided by voice recognition engine.
BRIEF DESCRIPTION OF THE DRAWINGS [0039] These aspects as well as other aspects of the invention will be clarified in the detailed description of particular embodiments of the invention, reference being made to the drawings of the figures, in which: FIG. schematically shows a speaker saying a message that is processed by a speech recognition engine; Fig.2 shows schematically an example of a speech recognition result; Fig.3 shows schematically different steps and their interaction of a preferred variant of the method of the invention; Fig.4 shows schematically an example of a post-processing system according to the invention.
The drawings of the figures are not to scale. Generally, similar elements are denoted by similar references in the figures. The presence of reference numbers in the drawings can not be considered as limiting, even when these numbers are indicated in the claims.
DETAILED DESCRIPTION OF PARTICULAR EMBODIMENTS [0040] FIG. 1 shows a speaker 40 (or user 40) saying a message 50 to a microphone 5. This message 50 is then transferred to a voice recognition engine 10 which is known to a user. skilled person. Different models and different brands are available on the market. In general, the microphone 5 is part of the speech recognition engine 10. The latter processes the message 50 with speech recognition algorithms, based for example on a hidden Markov model (MMC). It results in a result of 100 voice recognition. An example of result 100 is a hypothesis generated by voice recognition engine 10. Another example of result 100 is a solution obtained from speech recognition algorithms and from post-processing operations which are for example applied to one or more hypotheses generated by the voice recognition engine 10. Post-processing modules for providing such a solution may be part of the speech recognition engine 10. The result 100 is generally in the form of a text that can be deciphered by a machine, a computer or a processing unit for example. The result 100 is characterized by a beginning 111 and a end 112. The beginning 111 is prior to said end 112 along a time scale, t. The result 100 comprises a plurality of elements 113 temporally distributed between the beginning 111 and the end 112. An element 113 represents information between two different times along the time scale, t. In general, the various elements 113 are separated by portions of the result 100 representing a silence, a background noise, or a time interval during which no element 113 (word for example) is recognized by the speech recognition engine 10 .
The method of the invention relates to the post-processing of a result of 100 voice recognition. In other words, the input of the method of the invention corresponds to a result 100 obtained from speech recognition algorithms applied to a message 50 said by a speaker 40 (or user 40). FIG. 2 shows a voice recognition result 100. Between its beginning 111 and its end 112, the result 100 comprises several elements 113, seven in the case illustrated in FIG. 2. In this figure, the elements 113 are represented according to FIG. of time, t (abscissa). The ordinate, C, represents a level or rate of confidence. This concept is known to a person skilled in the art. This is a property or statistic generally associated with each item 113 and can be provided by a speech recognition engine 10 in general. A confidence rate is, in general, a probability that an item of the speech recognition result, determined by a voice recognition engine 10 from a spoken item, is the correct one. This property is known to a person skilled in the art. An example of a voice recognition engine is the VoCon® 3200 V3.14 model from Nuance. In this case, the confidence rate varies between 0 and 10 000. A value of 0 refers to a minimum value of a confidence rate (very low probability that the element of the speech recognition result is the correct one) and 10 0000 represents a maximum value of a confidence rate (very high probability that the element of the speech recognition result is the correct one). As a function of the height of an element 113 in FIG. 2, its confidence level 160 is higher or lower.
The first step of the method of the invention, the step i., Consists in receiving the result 100. Then, starting from the end 112, the method will isolate a first element 113. The method of the The invention will first isolate the last element 113 of the result along the time scale, t. Once this element 113 has been chosen, the method determines whether it is valid using a validation test. Different examples of validation tests are presented below. Then we go to the second element 113 starting from the end 112 and so on. According to a possible version of the method of the invention, all the elements 113 of the result 100 are thus traversed along the arrow shown at the top of FIG. 2 and stopped when the first element 113 along the scale of time, t, was determined valid or not. According to another preferred variant, the elements 113 of the result 100 are stopped running along the arrow at the top of FIG. 2 as soon as it has been detected that an element 113 is not valid. A post-processed solution 200 is then determined by resuming elements 113 which have been determined to be valid, preferably, using all elements 113 that have been determined valid. When determining the post-processed solution 200, it is necessary to keep the good order of the various elements 113 selected along a time scale, t. Thus, it should be taken into account that the first element 113 treated by the method of the invention represents the last element 113 of the message 100 and therefore must be found last in the post-processed solution 200 if it has been determined. as valid. In general, a voice recognition engine 10 provides, with the different elements 113 of the message 100, associated time information, for example the beginning and the end of each element 113. This associated temporal information can be used to classify in the correct order the elements determined valid in step iii.a., that is to say in a chronological increasing order.
[0043] Preferably, the method of the invention comprises a step of verifying that the post-processed solution 200 satisfies a grammar rule. An example of a grammar rule is a number of words. If the post-processed solution 200 does not satisfy such a grammar rule, it may be decided not to provide it. In this case, it is sometimes preferred to provide the result 100 of the voice recognition engine 10. If the post-processed solution 200 satisfies such a grammar rule, then it will be preferred to provide it.
FIG. 3 presents in schematic form a preferred version of the method of the invention in which: one stops isolating (or choose) an additional element 113 to make it undergo the validation test when it has been detected an invalid element 113, where - it is verified that the post-processed solution 200 satisfies a grammar rule (step vi.), where - the post-processed solution 200 is provided if it satisfies said grammar rule, and where - one provides the result 100 of the voice recognition engine 10 if the post-processed solution 200 does not satisfy said grammar rule.
Step iii.a consists in determining whether an element 113 selected in step ii. is valid using a validation test. The latter can take many forms.
An element 113 is characterized by a beginning and an end. It therefore has a certain duration 150. According to one possible variant, the validation test comprises a step of considering a valid element 113 if its duration 150 is greater than or equal to a threshold of shorter duration. The threshold of shorter duration is for example between 50 and 160 milliseconds. Preferably, the lower duration threshold is 120 milliseconds. The lower duration threshold can be adapted dynamically. According to another possible variant, the validation test comprises a step of considering a valid element 113 if its duration 150 is less than or equal to a threshold of greater duration. The threshold of greater duration is for example between 400 and 800 milliseconds. Preferably, the threshold of greater duration is 600 milliseconds. The threshold of greater duration can be adapted dynamically. Preferably, the lower duration threshold and / or the higher duration threshold is / are determined by a grammar.
In general, a confidence level 160 is associated with each element 113. According to another possible variant, the validation test comprises a step of considering a valid element 113 if its confidence level 160 is greater than or equal to a rate The minimum confidence level 161 may preferably vary dynamically. In such a case, it is then possible that the minimum confidence level 161 used to determine whether an item 113 is valid is different from that used to determine whether another item 113 is valid or not. The inventors have found that a minimum confidence level 161 between 3500 and 5000 provided good results, a still preferred value being 4000 (values for the VoCon® 3200 V3.14 model of Nuance but which can be transposed for others). voice recognition engine models).
According to another possible variant, the validation test comprises a step of considering a valid element 113 if a time interval 170 separating it from another element 113 directly adjacent to the end 112 of the result 100 is greater than or equal to one. minimum time interval. Such a minimum time interval is for example between zero and 50 milliseconds. According to another possible variant, the validation test comprises a step of considering a valid element 113 if a time interval 170 separating it from another element 113 directly adjacent towards the end 112 of the result 100 is less than or equal to a time interval maximum. Such a maximum time interval is for example between 300 and 600 milliseconds and a preferred value is 400 ms. For these two examples of validation test, we therefore consider the time interval 170 which separates an element 113 from its direct neighbor to the right in FIG. 2. In other words, we look at the time interval that separates an element 113 of its direct right-hand neighbor, that is to say its posterior neighbor along the time scale, t. A time interval separating two elements 113 is for example a time interval during which a voice recognition engine 10 does not recognize any element 113, for example no word.
According to another possible variant, the validation test is adapted to the speaker 40 (or user) who recorded the message 50. Each person pronounces elements 113 or words in a particular way. For example, some people say words slowly, while others pronounce them quickly. Likewise, a word confidence rate 160 provided by a speech recognition engine 10 generally depends on the speaker 40 who uttered the word. If one or more statistics associated with different elements 113 are known for a given speaker 40, they can be used during the validation test of step iii.a. to determine whether an item 113 is valid or not. For example, it can be considered that an item 113, said speaker 40 given, is valid if one or more statistics associated with this element 113 is / are compliant, within a range of error (10). % for example), to the same, to the same pre-established statistics for the same element 113 for the same speaker 40. This preferred variant of the validation test requires knowing the identity of the speaker 40. It can be provided for example by the voice recognition engine 10. According to another possibility, the post-processing method of the invention comprises a step of identifying the speaker 40.
In Figure 2, elements 113 considered valid are defined by continuous lines, while elements not considered valid are delimited by dashed lines. The fourth element 113 starting from the end 112 is for example considered invalid because that duration 150 is smaller than a threshold of shorter duration. The fifth element 113 starting from the end 112 is for example considered invalid because its confidence level 160 is lower than a minimum confidence level 161.
The inventors also propose a method for generating an optimized solution from a first and a second voice recognition result 100 and comprising the following steps: A. applying a post-processing method according to the first aspect of the invention to said first result 100; B. applying a post-treatment method according to the first aspect of the invention to said second result 100; C. determining said optimized solution from one or more elements 113 belonging to one or more results 100 of said first and second results 100 and which have been determined to be valid by the validation test of step iii.a [0052] According to a second aspect, the invention relates to a post-processing system 11 or post-processing device of a result 100 of speech recognition. Figure 4 schematically illustrates such a post-processing system 11 in combination with a speech recognition engine 10 and a screen 20. In this figure, the post-processing system 11 and the speech recognition engine 10 are two separate devices. According to another possible version, the post-processing system 11 is integrated into a voice recognition engine 10 so that it is not possible to differentiate them. In such a case, a conventional speech recognition engine 10 is modified or adapted to perform the functions of the aftertreatment system 11 described below.
Examples of a post-processing system 11 are: a computer, a speech recognition engine 10 adapted or programmed to perform a post-processing method according to the first aspect of the invention, a hardware module (or hardware ) a voice recognition engine 10, a hardware module adapted to communicate with a voice recognition engine 10. Other examples are nevertheless possible. The post-processing system 11 comprises acquisition means 12 for receiving and reading a result 100 of speech recognition. Examples of acquisition means 12 are: an input port of the post-processing system 11, for example a USB port, an Ethernet port, a wireless port (for example WIFI). Other examples of acquisition means 12 are nevertheless possible. The post-processing system 11 further comprises processing means 13 for performing the following steps recursively: isolating, from the end 112 to the beginning 111 of the result 100, an element 113 of the result 100 and which has not previously undergone a validation test of the processing means 13, determine whether it is valid using a validation test, determine a post-processed solution 200 by taking up at least one determined element 113 valid by said processing means 13. Preferably , said processing means 13 determine a post-processed solution 200 by taking up all the elements 113 determined valid by said processing means 13. Preferably, the post-processing system 11 is able to send the post-processed solution 200 to a screen 20 to display it.
Examples of processing means 13 are: a control unit, a processor or central processing unit, a controller, a chip, a microchip, an integrated circuit, a multi-core processor. Other examples known to those skilled in the art are nevertheless possible. According to one possible version, the processing means 13 comprise different units for carrying out the various steps mentioned above in connection with these processing means 13 (isolating an element 113, determining whether it is valid, determining a post-processed solution 200 ).
In a third aspect, the invention relates to a program, preferably a computer program. Preferably, this program is part of a human-machine voice interface.
According to a fourth aspect, the invention relates to a storage medium that can be connected to a device, for example a computer that can communicate with a voice recognition engine 10. According to another possible variant, this device is a motor voice recognition 10. Examples of storage medium according to the invention are: a USB key, an external hard disk, a CD-ROM type disk. Other examples are nevertheless possible.
The present invention has been described in connection with specific embodiments, which have a purely illustrative value and should not be considered as limiting. In general, the present invention is not limited to the examples illustrated and / or described above. The use of the verbs "to understand", "to include", "to include", or any other variant, as well as their conjugations, can in no way exclude the presence of elements other than those mentioned. The use of the indefinite article "a", "an", or the definite article "the", "the" or "I", to introduce an element does not exclude the presence of a plurality of these elements. The reference numerals in the claims do not limit their scope.
In summary, the invention can also be described as follows. A method of post-processing a voice recognition result 100, said result 100 comprising a start 111, a end 112 and a plurality of elements 113, said method comprising the steps of: reading said result 100; choose one of its elements 113; determine if it is valid; repeating the steps of selecting element 113 and determining its valid character or not; if at least one element 113 has been determined valid, determine a post-processed solution 200 by taking up at least one valid determined element 113. The method of the invention is characterized in that each element 113 is selected from said end 112 at the beginning 111 of the result 100 consecutively.

权利要求:
Claims (15)
[1]
claims
A method of post-processing a voice recognition result (100), said result (100) comprising a start (111), an end (112), and a plurality of elements (113) split between said beginning (111) ) and said ending (112), said post-processing method comprising the following steps: i. receiving said result (100); ii. isolating an element (113) from said plurality of elements (113) that has not passed the validation test of step iii.a. ; iii. then. if an element (113) has been isolated in step ii., determining whether it is valid using a validation test, b. otherwise, go directly to step v. ; iv. repeat steps ii. and iii .; v. if at least one element (113) has been determined valid in step iii.a, determining a post-processed solution (200) using at least one determined element (113) valid in step iii.a; characterized in that each element (113) isolated in step ii. is selected from said end (112) of the result (100) at said beginning (111) of the result (100) consecutively.
[2]
2. Method according to claim 1 characterized in that said elements (113) are words.
[3]
3. Method according to claim 1 or 2 characterized in that step iii.a. further comprises an instruction to go directly to step v. if the element (113) undergoing the validation test of step iii.a is not determined to be valid.
[4]
4. Method according to any one of the preceding claims characterized in that it further comprises the following step: vi. determining whether said post-processed solution (200) of step v. satisfies a grammar rule.
[5]
5. Method according to the preceding claim characterized in that it further comprises the following step: vii. at. if the answer to the test of step vi. is positive, providing said post-treated solution (200), b. otherwise, provide said voice recognition result (100).
[6]
6. Method according to any one of the preceding claims, characterized in that said validation test of step iii.a. includes a step of considering a valid element (113) if its duration is greater than or equal to a threshold of shorter duration.
[7]
7. Method according to any one of the preceding claims characterized in that said validation test of step iii.a. comprises a step of considering a valid element (113) if its duration is less than or equal to a threshold of greater duration.
[8]
8. Method according to any one of the preceding claims characterized in that each element (113) of said result (100) is characterized by a confidence rate (160) and in that said validation test of step iii.a . includes a step of considering a valid item (113) if its confidence level (160) is greater than or equal to a minimum confidence level (161).
[9]
9. Method according to any one of the preceding claims, characterized in that said validation test of step iii.a. includes a step of considering a valid element (113) if a time interval (170) separating it from another element (113) directly adjacent to said end (112) of the result (100) is greater than or equal to a time interval minimum.
[10]
10. Method according to any one of the preceding claims characterized in that said validation test of step iii.a. comprises a step of considering, for a given speaker (40), an item (113) of said valid result (100), if a statistic associated with that item (113) is within a given range, to a predetermined statistic for a same element (113) and for that speaker (40) given.
[11]
11. Method according to any one of the preceding claims, characterized in that all the elements (113) determined valid in step iii.a are taken again to determine said post-treated solution (200) in step v.
[12]
A method for determining an optimized solution from first and second voice recognition results (100) and comprising the steps of: A. applying a post-processing method according to any one of the preceding claims first result (100); B. applying a post-processing method according to any one of the preceding claims to said second result (100); C. determining said optimized solution from one or more elements (113) belonging to one or more results (100) of said first and second results (100) and which have been determined valid by the validation test of step iii .at.
[13]
13. A post-processing system (11) of a voice recognition result (100), said result (100) comprising a start (111), an end (112) and a plurality of elements (113) distributed between said beginning (111) and said ending (112), said post-processing system (11) comprising: - acquisition means (12) for reading said result (100); processing means (13): + for performing the following steps recursively: • isolating an element (113) from said plurality of elements (113) which has not previously undergone a validation test imposed by said means for processing (13), • determining whether the isolated element (113) is valid, using a validation test, and + for determining a post-processed solution (200) by taking up at least one valid determined element (113); characterized in that each element (113) isolated by the processing means (13) is selected from said end (112) of the result (100) at said beginning (111) of the result (100) consecutively.
[14]
A program for processing a voice recognition result (100), said result (100) comprising a start (111), an end (112) and a plurality of elements (113) distributed between said start (111) and said end (112), said program comprising a code to enable a device to perform the following steps: i. reading said voice recognition result (100), ii. isolating an element (113) from said plurality of elements (113) that has not passed the validation test of step iii.a., iii. then. if an element (113) has been isolated in step ii., determining whether it is valid using a validation test, b. otherwise, go directly to step v., iv. repeat steps ii. and iii .; v. if at least one element (113) has been determined valid in step iii.a, determining a post-processed solution (200) by taking up at least one determined element (113) valid in step iii.a; characterized in that each element (113) isolated in step ii is selected from said end (112) of the result (100) at said beginning (111) of the result (100) consecutively.
[15]
A storage medium connectable to a device and having instructions, which read, enable said device to process a voice recognition result (100), said result (100) comprising a start (111), an end (112) and a plurality of elements (113) distributed between said start (111) and said end (112), said instructions for imposing on said device to perform the following steps: i. reading said result (100); ii. isolating an element (113) from said plurality of elements (113) that has not passed the validation test of step iii.a., iii. then. if an element (113) has been isolated in step ii., determining whether it is valid using a validation test, b. otherwise, go directly to step v., iv. repeat steps ii. and iii .; v. if at least one element (113) has been determined valid in step iii.a, determining a post-processed solution (200) by taking up at least one determined element (113) valid in step iii.a; characterized in that each element (113) isolated in step ii. is selected from said end (112) of the result (100) at said beginning (111) of the result (100) consecutively.

类似技术:

公开号 | 公开日 | 专利标题

EP1362343B1|2007-08-29|Method, module, device and server for voice recognition

US9405741B1|2016-08-02|Controlling offensive content in output

EP0867856A1|1998-09-30|Method and apparatus for vocal activity detection

EP1154405A1|2001-11-14|Method and device for speech recognition in surroundings with varying noise levels

EP1585110A1|2005-10-12|System for speech controlled applications

CN108039181B|2021-02-12|Method and device for analyzing emotion information of sound signal

Triantafyllopoulos et al.2019|Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement.

EP1647897A1|2006-04-19|Automatic generation of correction rules for concept sequences

CN109065026A|2018-12-21|A kind of recording control method and device

BE1023435B1|2017-03-20|Method and system for post-processing a speech recognition result

BE1023458B1|2017-03-27|Method and system for generating an optimized voice recognition solution

CN109448746B|2020-03-24|Voice noise reduction method and device

BE1023427B1|2017-03-16|Method and system for determining the validity of an element of a speech recognition result

FR2769117A1|1999-04-02|LEARNING PROCESS IN A SPEECH RECOGNITION SYSTEM

EP1285435B1|2007-03-21|Syntactic and semantic analysis of voice commands

EP3627510A1|2020-03-25|Filtering of an audio signal acquired by a voice recognition system

WO2018077987A1|2018-05-03|Method of processing audio data from a vocal exchange, corresponding system and computer program

FR2867583A1|2005-09-16|Semantic, syntax and lexical electronic proof reader for e.g. dyslexic person, has vocal interaction module to select expression matching most phonetically with dictated expression automatically and replace wrong expression in digital text

Sipavičius et al.2016|“Google” Lithuanian Speech Recognition Efficiency Evaluation Research

EP1665231B1|2008-03-05|Method for unsupervised doping and rejection of words not in a vocabulary in vocal recognition

US20210306457A1|2021-09-30|Method and apparatus for behavioral analysis of a conversation

FR3111004A1|2021-12-03|Method of identifying a speaker

EP3195314A1|2017-07-26|Methods and apparatus for unsupervised wakeup

FR3105499A1|2021-06-25|Method and device for visual animation of a voice control interface of a virtual personal assistant on board a motor vehicle, and a motor vehicle incorporating it

FR2988894A1|2013-10-04|Method for detection of voice to detect presence of word signals in disturbed signal output from microphone, involves comparing detection function with phi threshold for detecting presence of absence of fundamental frequency

同族专利:

公开号 | 公开日

BE1023435A1|2017-03-20|

JP6768715B2|2020-10-14|

ES2811771T3|2021-03-15|

WO2016142235A1|2016-09-15|

EP3065131A1|2016-09-07|

PT3065131T|2020-08-27|

US20180151175A1|2018-05-31|

JP2018507446A|2018-03-15|

CN107750378A|2018-03-02|

EP3065131B1|2020-05-20|

PL3065131T3|2021-01-25|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US7181399B1|1999-05-19|2007-02-20|At&T Corp.|Recognizing the numeric language in natural spoken dialogue|

US20020133346A1|2001-03-16|2002-09-19|International Business Machines Corporation|Method for processing initially recognized speech in a speech recognition session|

US20050209849A1|2004-03-22|2005-09-22|Sony Corporation And Sony Electronics Inc.|System and method for automatically cataloguing data by utilizing speech recognition procedures|

US20070050190A1|2005-08-24|2007-03-01|Fujitsu Limited|Voice recognition system and voice processing system|

US20140249817A1|2013-03-04|2014-09-04|Rawles Llc|Identification using Audio Signatures and Additional Characteristics|

US5745602A|1995-05-01|1998-04-28|Xerox Corporation|Automatic method of selecting multi-word key phrases from a document|

US20060074664A1|2000-01-10|2006-04-06|Lam Kwok L|System and method for utterance verification of chinese long and short keywords|

AU5944601A|2000-05-02|2001-11-12|Dragon Systems Inc|Error correction in speech recognition|

US6754629B1|2000-09-08|2004-06-22|Qualcomm Incorporated|System and method for automatic voice recognition using mapping|

JP2004101963A|2002-09-10|2004-04-02|Advanced Telecommunication Research Institute International|Method for correcting speech recognition result and computer program for correcting speech recognition result|

JP5072415B2|2007-04-10|2012-11-14|三菱電機株式会社|Voice search device|

US8781825B2|2011-08-24|2014-07-15|Sensory, Incorporated|Reducing false positives in speech recognition systems|

US20140278418A1|2013-03-15|2014-09-18|Broadcom Corporation|Speaker-identification-assisted downlink speech processing systems and methods|

法律状态:

优先权:

申请号 | 申请日 | 专利标题

EP15157919.0|2015-03-06|

EP15157919.0A|EP3065131B1|2015-03-06|2015-03-06|Method and system for post-processing a speech recognition result|

[返回顶部]